This vignette will demonstrate the integration of spectral flow cytometry and CyTOF protein expression measurements using cyCombine.


First, let us present the two datasets we will integrate:

As an example of spectral flow cytometry data, we will use the dataset from Park et al. (2020). The data is based on human PBMCs, and contains four samples, which we downloaded and pre-gated to live single cells.

For the CyTOF data, we use the data from a single healthy donor processed at the Human Immune Monitoring Center. The sample was derived from FlowRepository https://flowrepository.org/id/FR-FCM-ZYAJ and pre-gated to live intact singlets. The data was subsequently manually gated to populations of interest.

We start by loading some packages

library(cyCombine)
library(tidyverse)
library(Seurat)




Loading data

Spectral flow cytometry data

We are now ready to load the spectral flow data. We have set up a panel file in csv format, so the correct information is extractable from there. Let us have a look at the contents:

# Directory with raw .fcs files
data_dir <- "Park_et_al_2020_Live+/"

# Panel and reading data
sfc_panel <- read_csv(paste0(data_dir, "/panel_Park2020.csv"))
#> 
#> ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> cols(
#>   Channel = col_character(),
#>   Marker = col_character(),
#>   Type = col_character()
#> )
sfc_panel


We then progress with reading the spectral flow dataset and converting it to a tibble format, which is easy to process. We use cofactor = 6000 for spectral flow data asinh-transformation. This looks reasonable in terms of separating negative and positive peaks in the data, and it was further recommended by Ferrer-Font et al. (2020).

sfc_markers <- sfc_panel %>% 
  filter(Type != "none") %>% 
  pull(Marker) %>% 
  str_remove_all("[ _-]")

spectral <- prepare_data(data_dir = data_dir,
                         metadata = paste0(data_dir, "/Spectral samples cohort.xlsx"),
                         filename_col = "FCS_name",
                         batch_ids = "Batch",
                         condition = "Set",
                         cofactor = 6000,
                         markers = sfc_markers,
                         down_sample = FALSE)
#> Reading 4 files to a flowSet..
#> Extracting expression data..
#> Your flowset is now converted into a dataframe.
#> Transforming data using asinh with a cofactor of 6000..
#> Done!


# We get rid of any negative values in the flow data using a linear shift - essentially these don't matter but it makes sense because cyCombine uses a cap of 0 for batch correction
spectral <- linear_shift(spectral, .keep = T)
#> Linearly shifting data..



Mass cytometry data

Then it is time to read the CyTOF data. In this case, it’s already in a tibble format - but refer to prepare_data as shown above, for this.

### Get HIMC cytof data for one sample
load('Sample001_pregated.Rdata')



Processing datasets - batch correction

Next, we need to figure out which markers to use when integrating the two datasets. Here, it is important to keep a couple of things in mind.

First, be careful with protein names - i.e. CD197 and CCR7 are actually the same protein, but each name is relatively commonly used. Make sure to correct the column names in either set, so it matches the other, i.e. using colnames(cytof)[colnames(cytof)=='CD197'] <- 'CCR7'. In this example, there are a total of 26 overlapping markers between the two datasets.

colnames(cytof)[colnames(cytof)=='CD197'] <- 'CCR7'
colnames(cytof)[colnames(cytof)=='CD8a'] <- 'CD8'

cytof_markers <- get_markers(cytof)
intersect(sfc_markers, cytof_markers)
#>  [1] "CD38"   "CD27"   "CD127"  "CD45RA" "CD16"   "CD56"   "CD8"    "CCR7"  
#>  [9] "IgD"    "CD3"    "CD28"   "CCR6"   "CXCR5"  "PD1"    "CD4"    "CD57"  
#> [17] "CD24"   "CD25"   "CXCR3"  "HLADR"  "CD20"   "TCRgd"  "CD14"   "CD19"  
#> [25] "CD123"  "CD11c"


Now, we are ready to combine the two datasets on the overlapping columns.

# Get overlapping column names
overlap_cols <- intersect(colnames(spectral), colnames(cytof))

# Make one tibble
uncorrected <- bind_rows(cytof[,overlap_cols], spectral[,overlap_cols]) %>%
  mutate(id = 1:(nrow(cytof)+nrow(spectral)))

And now, batch correction can be performed with cyCombine.

# Run batch correction
corrected <- uncorrected %>%
  batch_correct(xdim = 8,
                ydim = 8,
                norm_method = 'rank',
                ties.method = 'average')

# Add manually assigned labels back
# corrected$label <- uncorrected$label



Evaluating performance

We now evaluate the correction using EMD - each marker is evaluated across all cells.

corrected$label <- uncorrected$label <- 1

emd <- evaluate_emd(preprocessed = uncorrected,
             corrected = corrected)

emd$violin

emd$scatterplot


Finally, let us look at some plots to visualize the correction. First, the marker distributions before and after:

plot_density(uncorrected, corrected)

Finally, some UMAPs to visualize the correction. I will downsample to 10,000 cells from each dataset so it is easier to see what is going on.


inds <- split(1:length(uncorrected$batch), uncorrected$batch)
sample <- unlist(lapply(inds, sample, 10000))

plot1 <- plot_dimred(uncorrected[sample,], name = 'Uncorrected', type = 'umap')
plot2 <- plot_dimred(corrected[sample,], name = 'Corrected', type = 'umap')

plot1

plot2



Imputing non-overlapping markers

Now, we could say that the integration is done. We have an strong idea about which cells belong in the same populations across the two technologies. But we may really curious about the non-overlapping channels. E.g. let us have a brief look on the immunoglobulins (IgD, IgM, IgG). All of these are measured in the spectral flow dataset - only IgD is included in the CyTOF data. These three proteins should also be found only among the B-cells (CD19+/CD20+). This should be true even after imputation.

We will impute all the markers missing in either dataset first.

# Get the dataset-unique markers
spectral_unique <- sfc_markers[!sfc_markers %in% overlap_cols]
cytof_unique <- cytof_markers[!cytof_markers %in% overlap_cols]


# Add non-overlapping markers back to corrected data frames
cytof_corrected <- bind_cols(corrected[corrected$batch == 'CyTOF',], cytof[,cytof_unique])
spectral_corrected <- bind_cols(corrected[corrected$batch == 'Spectral',], spectral[,spectral_unique])


# Imputation for whole panels
imputed <- impute_across_panels(dataset1 = cytof_corrected, 
                                dataset2 = spectral_corrected,
                                overlap_channels = intersect(sfc_markers, cytof_markers), 
                                impute_channels1 = spectral_unique,
                                impute_channels2 = cytof_unique)
#> Warning in create_fsom(., markers = overlap_channels, xdim = 5, ydim = 5): This
#> function is deprecated. Please use 'create_som(som_type = 'fsom')' instead.
#> Mapping data to SOM
#> [1] "Performing density draws for dataset1"
#> [1] "Performing density draws for dataset2"


Now that we have imputed the values, we have a chance to view the IgG / IgM expression for both datasets. We will look at this along with the CD19, CD20, and IgD expression.

# Let us make the UMAP for the corrected set (based on overlapping markers only)
umap_plot <- plot_dimred(corrected[sample,], name = 'Corrected', type = 'umap', plot = 'CD19', return_coord = T)

# We can view the plot colored by CD19 expression
umap_plot$plot



# We can further include the UMAP coordinates in a combined dataframe with both imputed datasets. Again, we downsample to avoid too many cells cluttering the view (the same indeces are used as before)
combined <- cbind.data.frame(umap_plot$dimred, rbind.data.frame(imputed$dataset1, imputed$dataset2)[sample,])


# Now let us color the plot by the two other overlapping markers which are B cell-relevant

p <- list()
for (m in c('CD20', 'IgD', 'IgM', 'IgG')) {
  
  p[[m]] <- ggplot(combined, aes_string(x = colnames(combined)[1], y = colnames(combined)[2])) +
  geom_point(aes_string(color = m), alpha = 0.3, size = 0.4) +
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_viridis_c() + facet_wrap(~batch)

}

p
#> $CD20

#> 
#> $IgD

#> 
#> $IgM

#> 
#> $IgG


We can compare this to UMAPs for the spectral flow data only (before any batch correction):

# Let us make the UMAP for the corrected set (based on overlapping markers only)
spectral_sample <- (sample[grep('Spectral', names(sample))]-nrow(cytof))
umap_plot <- plot_dimred(spectral[spectral_sample,], name = 'Corrected', type = 'umap', plot = 'CD19', return_coord = T)

# We can view the plot colored by CD19 expression
umap_plot$plot



# We can further include the UMAP coordinates in a combined dataframe with both imputed datasets. Again, we downsample to avoid too many cells cluttering the view (the same indeces are used as before)
combined <- cbind.data.frame(umap_plot$dimred, spectral[spectral_sample,])


# Now let us color the plot by the two other overlapping markers which are B cell-relevant

p <- list()
for (m in c('CD20', 'IgD', 'IgM', 'IgG')) {
  
  p[[m]] <- ggplot(combined, aes_string(x = colnames(combined)[1], y = colnames(combined)[2])) +
  geom_point(aes_string(color = m), alpha = 0.3, size = 0.4) +
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + 
  scale_color_viridis_c()

}

p
#> $CD20

#> 
#> $IgD

#> 
#> $IgM

#> 
#> $IgG



The distributions look quite well-transfered to the CyTOF data.